Skip to content

Conversation

@Edwinhr716
Copy link
Contributor

What does this PR do?

Adds metric that measures the time spent in downloading the model, loading into GPU memory, and time it takes for the server to be ready to receive a request.

Because the router is the component that emits the metrics, but the launcher is the one who downloads the model and thus is where the download time is tracked, it is necessary to pass the duration from the launcher to the router. I considered passing it as a CLI argument, but to minimize the number of changes required opted to use an environment variable to do it instead. Open to suggestions as to how to better pass the value to the router.

To make it easier to work with Rust's Instant type I opted to measure two different values and then add them together: the time it takes to download to the model to launching the router; and the time it takes from launching the router to the router being ready to receive requests.

This PR is part of the metrics standardization effort.

Fixes # (issue)

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil
Copy link
Contributor

Narsil commented Aug 9, 2024

I'm still confused what's the interest to measure this one huge metric that encompasses so many different things:

  • Model downloads, from where
  • Potential model conversion
  • Actual model load time which can be coming from disk or CPU RAM.

As a model runner, getting ready times can be interesting, but without more context it's quite useless, no ?
What are users supposed to do with that metric ? It's also never being modified during the server's lifetime, so it's not really probing the system to do monitoring no ?
In our monitoring systems, we only care about the logs showing insights as to what's happening, which contain every step, why they are occurring and how long each step takes.

I have nothing against the PR itself, but adding code without clear reasons for clear benefits is always a bit strange.

@Edwinhr716
Copy link
Contributor Author

What are users supposed to do with that metric ?

because it is essentially the startup latency of the model, it is useful in determining the pod autoscaling threshold and frequency. So if model-load-time is above say 40 seconds, the user might not want to decrease the number of pods too often as it takes long to create a new pod once demand rises again.

cc @achandrasekar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants